Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating Pacific Biosciences long reads and a high-density linkage map

Abstract Background The Glanville fritillary (Melitaea cinxia) butterfly is a model system for metapopulation dynamics research in fragmented landscapes. Here, we provide a chromosome-level assembly of the butterfly's genome produced from Pacific Biosciences sequencing of a pool of males, combined with a linkage map from population crosses. Results The final assembly size of 484 Mb is an increase of 94 Mb on the previously published genome. Estimation of the completeness of the genome with BUSCO indicates that the genome contains 92–94% of the BUSCO genes in complete and single copies. We predicted 14,810 genes using the MAKER pipeline and manually curated 1,232 of these gene models. Conclusions The genome and its annotated gene models are a valuable resource for future comparative genomics, molecular biology, transcriptome, and genetics studies on this species.


Context
Identifying and characterizing genes underlying ecologically and evolutionarily relevant phenotypes in natural populations has become possible with novel genomic tools that can also be used in "non-model" organisms. The Glanville fritillary (Melitaea cinxia, NCBI:txid113334) butterfly, and in particular its metapopulation in theÅland Islands (southwest Finland), is an ecological model system in spatial ecology [1,2]. InÅland, the species inhabits a network of dry outcrop meadows and pastures and per-sists as a classic metapopulation with high turnover in patch occupancy [1]. The network of 4,500 potential habitat patches has been systematically surveyed bi-annually for butterfly occupancy and abundance since 1993 [3], providing a vast amount of ecological data on population dynamics [2]. Experimental manipulations under more controlled conditions are also possible owing to the small size, high fecundity, and relatively short generation time of the species. Consequently, our understanding of the species includes knowledge of life history variation across development stages [4,5], dispersal dynamics [6,7], species interactions with host plants and parasitoids [8][9][10][11][12], and stress tolerance [13,14]. During the past decade, the system has also been used to study genetic and evolutionary processes, such as identifying candidate genes underlying variation and evolution of dispersal in fragmented habitats [15] and host plant preference [16], and assessing allelic variation and its dynamics in space and time [17][18][19]. Several approaches have been used to explore the genetic underpinnings of phenotypic variation in the Glanville fritillary metapopulation, ranging from candidate gene approaches [13,20] and quantitative genetics [21,22] to wholegenome scans [23,24], under both laboratory and natural environmental conditions.
The first M. cinxia genome assembly was released in 2014 [25]. This genome was produced from a combination of 454 sequencing for contig assembly, followed by scaffolding with Illumina paired-end (PE), SOLiD mate-pair reads and Pacific Biosciences (PacBio) data. The size of the final assembly was 390 Mb made up from 8,261 scaffolds, with a scaffold N50 of 119,328. Scaffolds were assigned to chromosomes on the basis of a linkage map produced from RAD sequencing [25]. We recently assessed the actual genome size using a k-mer-based approach on Illumina sequencing data and obtained estimates ranging from 488 to 494 Mb (Supplementary File S5). It was considered that a new genome, sequenced using PacBio long reads, would result in a more complete assembly and better represent the repetitive areas of the genome.
Here, a new sequencing and assembly of the M. cinxia genome has been carried out using a pool of 7 male butterflies from a single larval family collected from Sottunga, an island in an eastern part of the archipelago. Sequencing was conducted using the PacBio RSII sequencer. An initial assembly was created using FALCON [27,28] followed by polishing performed with Quiver [27]. A new linkage map was created and used to assign the assembled scaffolds to their correct positions and orientations within the 31 chromosomes. The scaffolds were then gap-filled, producing a final assembly of 484 Mb with a scaffold N50 of 17,331,753 bp. The obtained genome size is well in line with the k-mer estimates. Gene prediction on the genome assembly was carried out using MAKER v 2.31.10 [29], which was run iteratively using several independent training sets. Manual annotation was performed for 1,232 of the gene models. The genome assembly increased greatly in contiguity and completeness compared to the first genome (Table 1), with chromosomal superscaffold N50 values of 17,331,753 bp in the new genome compared to 119,328 bp in the Version 1 genome.
The significant increase in assembly size warrants a further investigation of the composition of these added sequences. Initial observations of individual alignments from genome-togenome alignment show many collapsed repeat regions in the Version 1 genome that are mapped to multiple chromosomes in Version 2.

Methods
An overview of the processing pipeline for the work is shown in Fig. 1.

Genomic samples and DNA extraction
Owing to the facultatively univoltine life cycle of the butterfly in Finland, experimental inbreeding of the species would have taken several years. Therefore, we chose to sample individuals from an island population, Sottunga, expected to harbour lower genetic diversity compared to less isolated populations. Sottunga is part of theÅland Islands archipelago in the northern Baltic Sea, and the population was introduced here in 1991 using individuals collected on the mainland ofÅland Island [32]. This introduction was carried out with 71 larval families. The distance to the nearest M. cinxia population across the water is 5 km, and we therefore assume that the introduced population has remained (almost) completely isolated. Furthermore, the effective population size of M. cinxia in Sottunga has been very low during the past 24 years (on average 57 larval nests/year in 1993-2019), and it has experienced several strong bottlenecks [33]. Using genomic markers, Fountain et al. [17] 60 8.1768 N 20 40.1214 E). The larvae were collected once they were in diapause and most likely comprise full siblings [18]. The larval group was kept in diapause (+5 • C) until the following spring and then reared to adulthood under common garden conditions (28:8 • C; 12L:12D) at the Lammi Biological Station, University of Helsinki. After eclosion, butterflies were sexed and stored at −80 • C. High molecular weight DNA was isolated from 7 adult males using the caesium chloride method [25]. Several individuals were used to obtain enough starting material for constructing the Single Molecule, Real-Time (SMRT) sequencing library.

SMRT sequencing libraries and sequencing
Library construction for PacBio sequencing (PacBio RS II Sequencing System, RRID:SCR 017988) was carried out using the protocols recommended by the manufacturer (Pacific Biosciences, Menlo Park, CA, USA). Genomic DNA was sheared using a Megaruptor (Diagenode, Seraing, Belgium) followed by damage repair, end repair, hairpin ligation, and size selection using BluePippin (Sage Science, Beverly, MA, USA; RRID:SCR 020 505). After primer annealing and polymerase binding, the DNA templates were sequenced on a PacBio RSII sequencer using P6/C4 chemistry and 360 min video time at the DNA Sequencing and Genomics Laboratory, Institute of Biotechnology, University of Helsinki, Finland [34].

Genome Assembly
The genome was assembled using the FALCON assembler (FALCON-Integrate-1.8.6) [26, 27] with a read length cut-off of 18,000 bp. This cut-off was found to give the best contiguity for the assembly based on N50 value, while minimizing the percentage of possibly erroneous contigs. The erroneous contigs were detected by mapping markers of the linkage map from the previously published genome [25] to contigs and calculating the percentage of chimeric contigs. We tested 3 different read length cut-offs, 16,000, 18,000, and 20,000 bp, all of which included ∼9% of chimeric contigs. The assembly was based on 1.9M PacBio reads, 24.4 Gb in total, with an N50 of 18,479 bp, which is ∼50× coverage based on the final genome size. With the selected read cut-off the data produced 10.8 Gb of corrected reads that were further assembled using the FALCON software (Falcon, RRID:SCR 016089). The assembly yielded 4,559 primary contigs containing 739.9 Mb with an N50 of 340 kb and 1,661 alternative contigs containing 118.1 Mb with an N50 of 85,246 bp. The alternative contigs were automatically separated by the FALCON pipeline. The data were also assembled using miniasm software (0.2-r137-dirty) [35], which yielded similar results. The larger than expected initial assembly size, ∼1.5 times the k-mer estimate, is due to the multiple haplotypes originating from the 7 individuals used in sequencing.
To evaluate the putative chimeric contigs and assembly errors suggested by the genetic map, the raw SMRT sequencing data were mapped to the assembly primary contigs using BWA (BWA-0.7.17, RRID:SCR 010910) with the MEM algorithm [36]. The alignments of the 425 regions discovered as possibly chimeric were visually inspected. Of these regions, 92 showed even read coverage and no evident signs of assembly errors, while 333 regions contained areas with low coverage and/or repeat regions indicated by high coverage that had led to erroneous overlaps and misassemblies. These errors were identified by positions where the majority of the reads did not fully align; i.e., the alignments ended mid-read. The assembly was split in the positions where the coverage was at minimum. The resulting assembly was polished using the SMRT sequencing data and Quiver [26] software from the SMRT Tools-package (PacBio).

Linkage Map
Linkage mapping was constructed from whole-genome resequencing data of F2 crosses of M. cinxia. The grandparents of these F2 crosses are offspring of wild-collected M. cinxia originating from 2 distantly related M. cinxia populations around the Baltic Sea: theÅland Islands (ÅL) [1] and Pieni Tytärsaari (PT)  populations [37]. Between-population crosses of typeÅL♂×PT♀ andÅL♀×PT♂ were established to create the F1 population. Some of these F1 individuals were used to establish the F2 families, actively avoiding mating among siblings. A subset of the resulting full-sibling families were reared to adulthood, and 5 of these F2 families, together with their parents and grandparents, were selected for resequencing. In total, resequencing included 10 grandparental individuals, 10 F1 parents, and 165 F2 individuals (N: 185). All the larvae from different generations completed development under common garden conditions (28:15 • C; 12L:12D) utilizing fresh leaves of greenhouse-grown Veronica spicata. Diapausing larvae were kept in a growth chamber at +5 • C and 80% relative humidity for ∼7 months to mimic the normal wintertime conditions for these butterflies. Adults were kept in hanging cages (of 50 cm height and 40 cm diameter) at ∼26:18 • C; 9L:15D and fed ad libitum with 20% honey-water solution throughout the experiments.
Before DNA extraction the adult butterflies were stored at −80 • C, and either thorax or abdomen tissue of these individuals was used for sequencing. Tissues were homogenized prior to extraction using TissueLyser (Qiagen, Venlo, The Netherlands) at 30/s for 1.5 mins with Tungsten Carbide Beads, 3 mm (Qi-agen, Venlo, The Netherlands), and ATL buffer (Qiagen, Venlo, The Netherlands). DNA was extracted using the NucleoSpin 96 Tissue Core Kit (Macherey-Nage, Düren, Germany) according to the manufacturer's protocol with the exception that lysing time was extended to overnight. The samples were additionally treated with RNase A (Thermo Fisher Scientific, Waltham, MA. USA) before sequencing. Sequencing was performed using standard PE library preparation and Illumina HiSeq 2000 (Illumina HiSeq2000, RRID:SCR 020132) with 125 bp PE reads.
SeparateChromosomes2 was run on the final data (parameters lodLimit: 20; samplePair: 0.2;numThreads: 48) to obtain 31 linkage groups with a total of 2.4M markers. OrderMarkers2 was run (parameter recombination2: 0) on each linkage group (chromosome). This map was used to anchor the contig assembly into chromosomes. To validate anchoring, the map construction was repeated in the same way except that OrderMarkers2 was run on the physical order of markers to reduce noise in the linkage map. Finally, the raw data were remapped to the gap-filled chromosome-level assembly and the linkage map was redone in the new physical order to infer final recombination rates.

Anchoring the genome and resolving haplotypes using the linkage map
The contigs were aligned against each other and lift-over chains were created by running the first 2 steps (batch A and B to calculate the alignment chain) of the HaploMerger2 [40] pipeline. By manually inspecting this chain (all.chain.gz), contigs fully contained in some longer contig were removed. Initial contig order and orientation within each chromosome was calculated by the median map position of each contig and the longest increasing subsequence of markers, respectively. For each chromosome, Marey map [41], a scatter plot of physical and linkage positions combining the genetic and physical maps, and contigcontig alignments from the chain were recorded. The contigs' orders and orientations were manually fixed when needed if the map had support for alternative orientation. If the contigcontig alignments linked contigs together, they were joined. Any assembly errors that were found were corrected by splitting the contigs accordingly. Also, partially haplotypic contigs were found and collapsed, i.e., alternative haplotype sequence removed, on the basis of the Marey maps and contig-contig alignments. This manual work facilitated the removal of additional haplotype contigs and regions and resulted in the haploid reference genome sequence including start and end positions of contigs in the correct order and orientation for each chromosome. Of 2,933 contigs in initial reference, 4 were chimeric and were split to 9 separate contigs. Of the resulting 2,938 contigs, 1,080 were included without any modification, 825 were trimmed on 1 or both ends, and 1,033 were completely contained and thus removed. Finally, the haplotype-corrected genome was gap-filled using PBJelly software (PBSuite 15.8.24; RRID:SCR 0 12091) [42] with the original SMRT sequencing data and pol-ished with the Quiver tool [26] from the SMRT Tools-package 2.3.0 (PacBio) and with Pilon (1.21) (Pilon, RRID:SCR 014731) [43], which resulted in the final reference genome sequence of ∼484 Mb.
The chromosomes were aligned against the Heliconius melpomene (2.5) [44,45] and Pieris napi [46] genomes using the LAST aligner(938) [47] to check structural similarity between the species (Supplementary Figs S1-S13). An overview alignment for H. melpomene was created using D-GENIES (1.2.0) (D-GENIES, RRID:SCR 018967) [48] (Fig. 2). The data show a high level of collinearity between M. cinxia and H. melpomene chromosomes, as described before in Ahola et al. [25]. An interesting point is the lack of collinearity with sex chromosomes (M. cinxia chromosome 1 and H. melpomene chromosome 21). Furthermore, the visible vertical lines show the effect of long-read assembly on repeat resolution. With long reads spanning the repeats and allowing their accurate placement in the contigs, in M. cinxia the repeats are placed in single chromosomes whereas in H. melpomene they are present in all chromosomes.

Transcriptome assembly
To aid construction of gene models, we capitalized on 2 transcriptome assemblies that were produced as part of separate projects in our laboratory to be presented in upcoming publications ( [5], PRJNA670126). Importantly for gene model construction, they represent a wide range of transcriptional diversity, as the RNA-seq data are derived from various developmental stages (first instar larvae, fourth instar larvae, and adult thorax and abdomen). All individuals were lab-reared but originated from the same butterfly metapopulation. Transcriptome 1 was produced using a set of 78 individually sequenced female larvae (fourth developmental instar) [5], sequenced to an average depth of 17.3M reads (read lengths 85 and 65 bp for forward and reverse PE reads, respectively). Because the 2 sexes are practically indistinguishable in the larval stages, the females were identified on the basis of homozygosity across a set of 22 Z-chromosomespecific single-nucleotide polymorphism loci [5]. To remove Illumina adapter sequences, we trimmed raw reads using Trimmomatic (Trimmomatic-0.35, RRID:SCR 011848) [56], and normalized using Trinity v2.6.5 (Trinity, RRID:SCR 013048) [57]. We then used 2 separate procedures to construct de novo transcriptome assemblies, Trinity (v2.6.5) and Velvet/Oases (1.2.10) [58]. Trinity was run with standard settings, whereas Velvet/Oases used a range of 7 k-mer sizes (21-71 bp), producing a separate assembly for each k-mer size. We then combined the resulting assemblies, filtered the combined assembly using the EvidentialGene (tr2aacds.pl VERSION 2017.12.21) [59] pipeline, and removed contigs smaller than 200 bp or expressed at a low level (<1 normalized counts per million), yielding the final assembly. Transcriptome 2 was constructed from a set of 12 adult females (thorax and abdomen, without ovaries) and 48 first instar larvae, as part of a separate gene expression study (PRJNA670126). RNA from these 60 individual samples was sequenced to an average depth of 16.6M reads (86/74 bp PE). The stranded RNA-seq libraries were made using Ovation R Universal RNA-Seq System (Nugen) with custom ribosomal RNA removal. The libraries were PE sequenced on a NextSeq 500 using the 150 bp kit (Illumina) at the DNA sequencing and genomics laboratory Institute of Biotechnology University of Helsinki. We trimmed the reads using fastp (v0.20.0) [60], and used the HISAT2 2.0.4 (HISAT2, RRID:SCR 015 530)/StringTie 1.3.5 (StringTie, RRID:SCR 016323) pipeline [61] to construct a genome-guided transcriptome assembly, mapping the RNA-seq reads to the new genome assembly. Transcriptome 1 yielded 69,182 putative transcripts with mean length of 727 bp (95% CI: 206-3,433), while Transcriptome 2 yielded 137,250 putative transcripts with mean length of 1,737 (95% CI: 203-9,106). These statistics should be interpreted with caution because the assemblies derive from different life stages, and different assembly and filtering approaches were used (reflecting differences in histories of the datasets as they were produced for different projects).

Gene model annotation
Initial gene predictions were obtained by running the MAKER v 2.31.10 [29] gene prediction program in an iterative procedure. In the first round of MAKER, transcriptome assembly 1, described above, was provided as evidence, and genes were predicted solely from the aligned transcripts. This resulted in 14,738 gene models. These gene models were then used for training the SNAP (2013-02-16) [62] and AUGUSTUS (3.3.2) (Augustus, RRID: SCR 008417) [63] gene predictors. A second round of MAKER was run providing the de novo transcripts from both transcriptomes (see previous paragraph), trained gene prediction models, repeat masking file, and protein data from other lepidopteran species. The MAKER settings were adjusted to allow prediction of gene models without requiring a corresponding transcript in the de novo transcriptome assembly. Following each round of MAKER gene prediction, the annotation completeness was assessed using BUSCO (Benchmarking Universal Single-Copy Orthologs, RRID:SCR 015008) [64,65].

Manual annotation
Manual annotation was performed for 1,232 genes, using the Apollo collaborative annotation system Version 2.1.0 [66]. The collaborative annotation environment was set up in Ubuntu Linux 14.04 server with 250 GB RAM and 48 AMD Opteron 6 168 processing cores. This was later upgraded to a cloud server provided by the Finnish IT Center for Science (CSC) and run on Ubuntu Linux 18.04 with 200 GB RAM and 40 Intel Xeon model 85 processing cores. Evidence tracks were produced containing gene predictions from 3 rounds of MAKER, RNA-seq alignments of sequence reads, and protein alignments from other species (Table 2). RNA-seq alignments comprised a mixedtissue pooled sample, an abdomen pooled sample, and 6 larval samples (from Transcriptome 1) selected to represent a diverse range and included, e.g., both sexes and different family backgrounds. A list of gene families that were considered of particular interest in butterfly research were identified for prioritization during the manual annotation. (Supplementary File S4). The gene annotators were able to select a family of genes for annotation, or a random selection from the prioritized families was given. Gene models were corrected by examining the evidence tracks in the browser, conducting blast searches, and examining multiple alignments of protein sequences. In total for the 1,232 genes, 1,455 messenger RNAs (mRNAs) were manually inspected, of which 814 genes and mR-NAs were changed. Most changes were made to exon borders and mRNA exon structure, especially in the case of multiple isoforms.

Final gene models
Following the manual annotation, the SNAP [62] and AUGUS-TUS [63] gene predictors were retrained using the manually annotated gene models. MAKER was run using the updated gene predictors, Transcriptome 1 and 2, and using a masking file for repeats. As a final step to incorporate the manually annotated gene models, MAKER was run, providing the previous MAKER file to pred gff and the manually annotated models to model gff. Gene functional prediction was performed using Pannzer v2 [68].

Data Validation and Quality Control
To assess the quality of the assembly, assembly statistics were generated using assembly-stats [30] and compared to the v1 genome, as well as the H. melpomene, B. mori, and P. napi genome assemblies ( Table 1). The new genome contains 94 Mb more sequence than the previous scaffold assembly. On the basis of the observations of individual alignments in the full genome alignment between Version 1 and Version 2, there are many regions in Genome 1 that are aligned into multiple positions in Version 2. This points to collapsed repeat regions in Version 1 and more accurate repeat placement due to the long-read sequencing in Version 2. The N50 length and L50 value at scaffold or chromosome level improved greatly compared to the previous genome. To check for possible duplication or missing areas in the assembly, an assessment was made for the completeness of single-copy orthologs from BUSCO [64,65] eukaryota, arthropoda, and metazoa gene sets (Table 3). In each of the gene sets, 93.0-94.9% of the expected single-copy orthologs were found in complete copies. The duplication rate was estimated to be between 1.4 and 1.5%. A total of 1,232 gene models were manually curated using the Apollo annotation system [66] to ensure the quality of the models. To test for contamination, the predicted protein sequences were checked with AAI-profiler [73] to identify sequences originating from different taxa (Supplementary Files S1-S3). Overall, 42% of the genome was composed of repeat sequences ( Fig. 4 and Supplementary Figs S15-S20 [chromosome-specific repeat classes]). There were no clear differences in the repeat contents between chromosomes (Supplementary Table S1), which further supports the more accurate placement of repeats due to the long-read sequencing in Version 2. Long interspersed nuclear elements (LINEs) were the most prevalent.

Reuse potential
The substantial improvements in contiguity and gene annotation quality of the new genome will enable a range of important new studies and open up possibilities for future work. The results also demonstrate that with the use of proper computational tools and data, it is possible to obtain a high-quality, chromosome-scale reference genome even when a single individual organism will not provide enough high molecular weight DNA for long-read sequencing. Furthermore, we show the potential of the linkage mapping: it anchors contigs to actual chromosomes instead of just linking different contigs together as is done, for example, in the Hi-C approach. Moreover, the haplotype problem is not tackled by Hi-C. Our high-density linkage map allows us to put nearly all contigs into chromosomes. It is worth noting that the linkage map is not scaffolding directly but it puts contigs into map positions; scaffolding is possible if a contig spans 2 or more map positions. Otherwise, the contig can be placed only partially. In addition to the linkage map approach, we used extensive manual curation of the assembly to avoid chimeric parts and improve the assembly quality. Current research aims at identifying mechanisms underlying key life history adaptations, exploring the extent of natural variation and selection on these adaptations in wild populations, and integrating these insights with the exceptional ecological, demographic, and climatic data available for this system. Future studies in this direction will help identify the mechanisms maintaining variation in life histories across spatial and temporal scales, and the extent to which phenotypic variation in these and other traits may contribute to a population's adaptive capacity under climate change. Several studies in different species illustrate how stress responses can be crucial for survival under variable environments, both within and between generations. The Glanville fritillary is being used to explore how environmental information is translated into adaptive phenotypic  changes, and how these responses are transmitted to future generations, using transcriptomic and epigenetic approaches. Such studies will benefit from an improved annotation permitting exon-specific expression quantification, and identification of epigenetic marks and other functional variants outside coding regions. Exploiting current and past large-scale sampling efforts, these new studies apply population genomic approaches that are facilitated by the increased assembly contiguity, e.g., by permitting linkage disequilibrium and haplotype-based selection analyses. Other avenues of research enabled by the improved genome assembly include structural variation, regulatory evolution, recombination rate variation, and coalescent-based demographic analyses. The increasing availability of chromosomelevel lepidopteran genomes such as ours permits exciting new comparative phylogenetic analyses, e.g., of chromosome and genome evolution.

Data Availability
The SMRT sequencing reads used for the genome assembly are available in the NCBI SRA and can be accessed with Bioproject PRJNA607899 accession No. SRR11184190. The genome has been deposited to GenBank under Bioproject PRJNA607899.
The Illumina reads used for the linkage map are available in the NCBI SRA and can be accessed with Bioproject PRJNA608928 accession Nos. SRR11186917-SRR11187107.
Transcriptome 1 RNA-seq reads are available in NCBI GEO and can be accessed with accession No. GSE159376.
Transcriptome 2 RNA-seq reads are available in NCBI SRA and can be accessed with Bioproject PRJNA670126.
All supporting data and materials are available in the Giga-Science GigaDB database [74].